FPGA architecture having two-level cluster input interconnect scheme without bandwidth limitation

ABSTRACT

An interconnect architecture for a programmable logic device comprises a plurality of interconnect routing lines. The data inputs of a plurality of first-level multiplexers are connected to the plurality of interconnect routing lines such that each interconnect routing line is connected to only one multiplexer. A plurality of second-level multiplexers are organized into multiplexer groups. Each of a plurality of lookup tables is associated with one of the multiplexer groups and has a plurality of lookup table inputs. Each lookup table input is coupled to the output of a different one of the second-level multiplexers in the one of the multiplexer groups with which it is associated. The data inputs of the second-level multiplexers are connected to the outputs of the first-level multiplexers such that each output of each first-level multiplexer is connected to an input of only one second-level multiplexer in each multiplexer group.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.11/855,974, filed Sep. 14, 2007, now issued as U.S. Pat. No. 7,408,383,which claims priority to U.S. Provisional Patent Application Ser. No.60/825,872, filed Sep. 15, 2006, both of which are incorporated byreference as if set forth herein.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to field programmable gate array (FPGA)architectures. More specifically, the invention relates to an areaefficient interconnect scheme for a cluster based FPGA architecture thatconnects inter-cluster routing tracks to the inputs of look-up tables(or other logic cells) in the cluster.

2. The Prior Art

A cluster architecture is a type of FPGA architecture in which the basicrepeating layout tile is a cluster. The cluster is an aggregation oflogic blocks and routing multiplexers. Usually, a limited number ofinputs are provided into the cluster in order to save area. A routingmultiplexer is a basic FPGA routing element with multiple inputs and oneoutput. It can be programmed to connect one of its inputs to the output.The number of inputs to the routing multiplexer is called themultiplexer size. A crossbar is equivalent to M multiplexers with eachmultiplexer selecting an output from a subset of N inputs. An N×Mcrossbar connects N different inputs to M outputs. If the N inputs aredrawn as N horizontal wires, and M outputs vertical wires, there are N*Mcrosspoints, with each one representing a possible input-outputconnection. The number of connections (or switches) in a crossbar is thenumber of provided connections. A fully populated crossbar has N*Mconnections. A p % sparsely populated crossbar has (NM*p %) connections.

A cluster input interconnect scheme is an interconnect network thatconnects inter-cluster routing tracks to inputs of lookup tables (LUTs)(or other logic cells). It usually consists of multiplexers. Dependingon the number of multiplexers that a routing track signal needs to passthrough to reach LUT inputs, it could be classified as a one-levelscheme or a two-level scheme. Depending on the number of unique signalsthat may be routed to the LUT inputs simultaneously, it could beclassified as “having input bandwidth limitation” or “not having inputbandwidth limitation.” Usually, one-level schemes do not have inputbandwidth limitation, while two-level schemes exhibit input bandwidthlimitation.

A one-level input interconnect scheme is a scheme that connects therouting tracks directly to the logic cells or LUT input multiplexers andusually has no bandwidth limitation. This scheme has been used, forexample, in FPGAs available from Xilinx of San Jose, Calif. Anillustrative example of such a scheme is shown in FIG. 1. This schemetakes signals from a plurality of T input tracks 10-1 through 10-T. Aplurality of M input signals on lines 12-1 through 12-M are programmablyconnected to the inputs of multiplexers 14-1 through 14-P through aninterconnect matrix 16 including programmable interconnect elements.There are numerous kinds of programmable interconnect elements as isknown in the art.

The outputs of multiplexers 14-1 through 14-P each feed an input of oneof N LUTs identified by reference numerals 18-1 through 18-N. Each ofLUTS 18-1 through 18-N has multiple inputs. Let S be the number ofinputs of the LUT, or LUT size (for example, S=4 for 4 input LUT).Therefore, the number of input multiplexers P=S*N (total number of LUTinputs for N LUTs). The number of input signals M<=P*MUX size, sinceeach input signal is allowed to fan out to more than 1 input MUX.Finally, the number of routing tracks T>=M.

Architectures of the type shown in FIG. 1 are usually not bandwidthlimited in that the total number of input signals that are provided isat least equal to or (more often) considerably larger than the totalnumber of multiplexer inputs; i.e. M>=P.

A two-level input interconnect scheme is a scheme that connects therouting tracks first to inputs of first-level multiplexers. The outputsof the first-level multiplexers are connected to inputs of LUT inputmultiplexers (or second-level multiplexers). The two-level inputinterconnect scheme includes first and second stage crossbars.

An example of a two-level input interconnect scheme is shown in FIG. 2.As in the scheme shown in FIG. 1, the two-level interconnect schemeshown in FIG. 2 takes signals from a plurality of T input tracks 10-1through 10-T. A plurality of M input signals are connected to the inputsof first-level multiplexers 14-1 through 14-10 using an interconnectmatrix crossbar 16. Multiplexers 14-1 through 14-10 are shown eachhaving sixteen inputs.

The outputs of the first-level multiplexers 14-1 through 14-10 areconnected to the inputs of P (P=16) second-level multiplexers 18-1through 18-16 using an interconnect matrix crossbar 20. The outputs ofmultiplexers 18-1 through 18-16 each feed an input of one of N LUTs(N=4) identified by reference numerals 24-1 through 24-4. Each of LUTs24-1 through 24-4 has S inputs. As in FIG. 1, the number of second-levelmultiplexers P=S*N. The number of first-level multiplexers 14-1 through14-10 is K=(S*(N+1)/2). Also, as in FIG. 1, the number of firstmultiplexer input signals M<=K*MUX size, since each input signal isallowed to fan out to multiple first-level MUXes. And the number ofrouting tracks T>=M.

Prior-art two-level schemes have bandwidth limitations. The bandwidthlimitation comes from the fact that the number of first-level MUXes K(=(S*(N+1)/2)) is smaller than the number of LUT input MUXes P (=S*N),which means that N LUTs (i.e., S*N LUT inputs) have to share at most Kunique input signals. The bandwidth limitation is necessary to make thescheme area efficient. There are many publications discussing how largethe bandwidth limitation should be. For four-input LUTs, a type of logicblock commonly used in FPGAs, the limitation on the number of uniquesignals going into a cluster simultaneously is generally accepted to be4*(N+1)/2=2N+2, where N is the number of four-input LUTs in a cluster.

An input bandwidth limitation is the number of unique routing tracksignals that can be simultaneously routed to the LUT inputs through acluster input interconnect. A cluster of N LUTs each having S inputscould need S*N unique signals in the worst case. If the number of uniqueinput signals (out of M available to the cluster) that can besimultaneously routed to the LUT inputs is smaller than S*N, then it issaid that the cluster (or the cluster input interconnect) has inputbandwidth limitation. Otherwise, the cluster (or the cluster inputinterconnect) has no bandwidth limitation.

The bandwidth limit imposes a hard constraint in clustering, i.e., ifthe number of unique external signals required by the cells in thecluster exceeds the bandwidth limit, the cluster is not routable. Such ascheme has been used in academia (VPR-type architecture). A VPR-typearchitecture is an FPGA architecture popular in academia that is basedon LUT clusters. The cluster input scheme in VPR-type architecture is atwo-level scheme with bandwidth limitation S*(N+1)/2. The firstinterconnect crossbar is usually sparsely populated, and the secondinterconnect crossbar is assumed to be fully populated. A VPR-typearchitecture usually assumes full population in the second crossbar,which is very area expensive.

Such a scheme has also been used in FPGAs available from Altera Corp. ofSan Jose, Calif. Commercial products like the Stratix line of productsfrom Altera use 50% connection population in the second crossbar.

Researchers have studied the depopulation of two-level interconnectschemes by looking into each stage separately. The research hasconcluded that having K>=S*N number of first-level MUXes in such ascheme (i.e., no bandwidth limitation, or allowing all LUT inputs tohave unique input signal) is excessive and therefore a waste ofresources. On the other hand, at least one article has indicated that anM=K*MUX size depopulation scheme provides poor routability (see GuyLemieux and David Lewis. Design of Interconnection Networks forProgrammable Logic. Kluwer Academic Publishers, 2004 (“Lemieux andLewis”)).

In the prior art, the Monte Carlo method is used for measuringroutability. This method picks a large number of random routing vectors,and measures the percentage of them that can be routed on a routingstructure. The obtained percentage measures the routability of therouting structure, and can be used to guide iterative improvement of theconnectivity in the routing structure. This method can only be used fora one-level crossbar.

BRIEF DESCRIPTION OF THE INVENTION

An interconnect architecture for a programmable logic device comprises aplurality of interconnect routing lines. The data inputs of a pluralityof first-level multiplexers are connected to the plurality ofinterconnect routing lines such that each interconnect routing line isconnected to only one multiplexer. A plurality of second-levelmultiplexers are organized into multiplexer groups. Each of a pluralityof lookup tables is associated with one of the multiplexer groups andhas a plurality of lookup table inputs. Each lookup table input iscoupled to the output of a different one of the second-levelmultiplexers in the one of the multiplexer groups with which it isassociated. The data inputs of the second-level multiplexers areconnected to the outputs of the first-level multiplexers such that eachoutput of each first-level multiplexer is connected to an input of onlyone second-level multiplexer in each multiplexer group.

BRIEF DESCRIPTION OF THE DRAWING FIGURES

FIG. 1 is a block diagram of a single-level prior-art interconnectscheme.

FIG. 2 is a block diagram of a two-level prior-art interconnect scheme.

FIG. 3 is a block diagram of an illustrative two-level interconnectscheme according to the present invention.

FIG. 4 is a block diagram of another illustrative two-level interconnectscheme according to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Persons of ordinary skill in the art will realize that the followingdescription of the present invention is illustrative only and not in anyway limiting. Other embodiments of the invention will readily suggestthemselves to such skilled persons.

An illustrative embodiment of the present invention as shown in FIG. 3includes an interconnect scheme for routing tracks to inputs of logiccells in a cluster-based programmable logic architecture that has twomultiplexer levels. The interconnect scheme of the present invention hasno bandwidth limitation; that is, a unique signal can be brought in forevery logic cell input. In another embodiment the architecture may havea limited bandwidth limitation that is greatly reduced compared to theprior art.

Referring now to FIG. 3, a block diagram shows an illustrative exampleof a two-level input interconnect scheme according to the presentinvention. As in the schemes shown in FIGS. 1 and 2, the two-levelinterconnect scheme takes signals from a plurality of T input tracks30-1 through 30-T. A plurality of M (M=80) input signals are connectedto the inputs of K (K=16) first-level multiplexers 32-1 through 32-16using an interconnect matrix crossbar 34.

The outputs of the first-level multiplexers 32-1 through 32-16 areconnectable to the inputs of P (P=16) second-level multiplexers 36-1through 36-16 using an interconnect matrix crossbar 38. The outputs ofmultiplexers 36-1 through 36-16 each feed an input of one of N (N=4)LUTs identified by reference numerals 40-1 through 40-4. Each of LUTS40-1 through 40-4 has S (S=4) inputs.

As in FIGS. 1 and 2, the number of second-level multiplexers P=S*N.Also, as in FIGS. 1 and 2, the number of first multiplexer input signalsM<=N*MUX size. The number of input tracks T>=M.

While FIG. 3 appears to be similar to FIG. 2, there are key differencesthat result in the advantages provided by the present invention. Thenumber K of first-level multiplexers 32-1 through 32-16 is K>=(S*N).

There is no bandwidth limitation in the scheme shown in FIG. 3 becausethe number of first-level multiplexers 32-1 through 32-16 is at least aslarge as the number of input multiplexers P=S*N. This allows up to Kinputs signals coupled to the second-level multiplexers, which, in turn,allows each of the S*N LUT inputs connected to a unique input signal. Inanother embodiment, the number of first-level multiplexers isapproximately the number of input multiplexers.

Area efficiency is achieved by depopulation in the interconnects betweenrouting tracks and the first-level multiplexers (the first stageinterconnect) as well as between the first and second-level multiplexers(the second stage interconnect). One of many possible depopulationschemes could be used in each stage as described in more detail herein.One important element of an embodiment of the invention is to determinethe depopulation schemes and the parameters for the depopulation schemesfor the first and second stages in conjunction with each other to assurean efficient architecture with little or no bandwidth limitation. Thetwo sets of routing interconnects are jointly designed to implement thedesired connectivity efficiently, i.e., the depopulation schemes arejointly optimized to minimize area and maximize routability.

One way to depopulate the first stage interconnect is to have just Mswitches in the interconnect (so the population is 1/K). Each of the Minputs is connected to just one of the first-level MUXes. This is thesparsest depopulation one can do if one still wants all M input signalsto be connectable. In this case, an M=K*MUX depopulation scheme is ableto be employed, although it was not considered to be usable in the priorart (see Lemieux and Lewis). The second stage interconnect isdepopulated to have 1/S depopulation by partitioning first stage MUXoutputs into S subgroups, with each subgroup driving one input mux (outof S) of each LUT.

The present invention provides an advantage over a one-level scheme isthat it achieves better routability with a smaller number of switches.It can be used in large clusters where one-level scheme would be tooinefficient. The present invention also provides an advantage over aprior-art two-level scheme in that it does not have bandwidthlimitation, thus software (e.g., place and route software) is free fromsuch constraint. With a higher number of second-level multiplexers,aggressive depopulation of both the first crossbar and the secondcrossbar may be implemented while still achieving good routability.

An illustrative way to build the two-level interconnect of the presentinvention is presented herein. However, other schemes and enhancementsare also possible. For example, both crossbars may be more populatedthan what is shown in FIG. 3. The crossbar between the first-level andsecond-level multiplexers may be populated up to about twice what isshown in FIG. 3, along with increasing the number of inputs in thesecond-level multiplexers. If the population of the second crossbar wasdoubled, the number of inputs to the second-level multiplexers wouldalso be doubled. Likewise, the population of the first crossbar may alsobe increased up to about twice what is shown in FIG. 3 what is shown inFIG. 3, in a manner similar to that discussed for the second crossbar,with a corresponding increase in the number of inputs to the first-levelmultiplexers. In addition, if the cluster is large, two crossbars may beused in parallel. Further, one or more input signals may be configuredto bypass the first level and go directly to the second level to providefaster timing.

One example for a four-input LUT-based cluster, is N=8 with M=160, andK=32. In the first stage interconnect each second-level multiplexertakes 5 routing tracks; i.e., each routing track only drives onesecond-level multiplexer. The number of connections between routingtracks and second-level multiplexers is 192. The population is 1/K (onlyM connections out of M*K maximal possible connections).

In the second stage interconnect the population from second-levelmultiplexers to LUT input multiplexers is 25%, i.e., the number ofpotential connections is 25% of the maximal possible value K*N*4. To bespecific, each second-level multiplexers drives eight LUT inputmultiplexers (one for each LUT).

Contrary to the prior art assumptions and approaches, experiments haveshown that the above structure shows good routability despite thedepopulation. And overall connection count is smaller than evendepopulated VPR-type architectures.

FIG. 4 is another example that employs two such interconnect structuresin parallel when the cluster is large. An illustrative example is acluster of sixteen 4-input LUTs with 256 input routing tracks.Therefore, (M=256, k=4, N=16). There are two ways to form such astructure. First, an example using sixty-four first-level multiplexers(to guarantee no bandwidth limitation) could be employed.

In the alternative, two parallel structures could be employed, eachserving half the LUTs. Such a structure is shown in FIG. 4. The M inputseach have two fan outs, one to section 50-1, and the other to section50-2. Sections 50-1 and 50-2 are identical, except that section 50-1drives LUTS 52-1 through 52-8 and section 50-2 drives LUTs 52-9 through52-16. The arrangement of FIG. 4 has no bandwidth limitation. Each ofsections 50-1 and 50-2 has thirty-two first-level multiplexers toguarantee no bandwidth limitation. Both crossbars take the same set of Minputs, but each only drives half of the LUTs. Each crossbar is builtusing the approach described above with reference to FIG. 3.

The advantage of FIG. 4 mainly applies to large clusters. It is achievedby balancing the sizes between first-level and second-levelmultiplexers. Suppose it is desired to build an interconnect for acluster of sixteen 4-input LUTs with 256 incoming signals, then N=16,M=256, and K=64. If the approach of FIG. 3 is used, it would requiresixty-four first level multiplexers (each with size 256/64=4), andsixty-four second-level multiplexers (each with size 64/4=16). Thisresults in a total of 1,280 connections (64×4+64×16). There is animbalance between the size of first-level multiplexers (which is 4) andthat of second-level multiplexers (which is 16). With the approach ofFIG. 4, two parallel sub-interconnects are used. Each sub-interconnecthas thirty-two first level multiplexers (each with 8 inputs), andthirty-two second level multiplexers (each with 8 inputs). The number ofconnections in one sub-interconnect would be 512 (32×8+32×8). So thetotal number of connections combined is 1,024 (512×2). The size offirst-level and second level multiplexers is perfectly balanced (bothhave 8 inputs). This saves 256 connections (20% compared with theapproach of FIG. 3), while routability remains almost the same (usingentropy measurement, the entropy of interconnect built using FIG. 4approach is only 3% smaller than using the approach of FIG. 3). Entropyis discussed in Wenyi Feng and Sinan Kaptanoglu. Designing efficientinput interconnect blocks for LUT clusters using counting and entropy.FPGA 2007, Feb. 18-20, 2007, Monterey, Calif. This article isincorporated herein by reference.

The area efficiency of the two alternative approaches may be calculated.If the structure is formed as a whole, (each first-level multiplexerdrives 16 loads). Assuming sixty-four first-level multiplexers: Totalnumber of switches used: 256+64×16=1,024+256=1,280; entropy of the wholeunit is 346.88; entropy per switch is 0.271. If the structure is builtas two-sub-units, each with 32 L1-MUXes: total number of switches used:256×2+64×8=1,024; entropy of the entire unit is 336.77; entropy perswitch is 0.329. It may be seen that the area efficiency of the secondalternative is better than the first one (0.329 vs. 0.271).

In general, when cluster size is large, it is not efficient to have eachfirst-level multiplexer fan out to all LUTs, because the second-levelmultiplexers will become too large, reducing area efficiency. With theapproach disclosed herein, the area efficiency of differentimplementations can be computed and compared.

While embodiments and applications of this invention have been shown anddescribed, it would be apparent to those skilled in the art that manymore modifications than mentioned above are possible without departingfrom the inventive concepts herein. The invention, therefore, is not tobe restricted except in the spirit of the appended claims.

1. An interconnect architecture for a programmable logic devicecomprising: a plurality of interconnect routing lines; a plurality offirst-level multiplexers each having data inputs and a data output, thedata inputs of the first-level multiplexers connected to the pluralityof interconnect routing lines such that each interconnect routing lineis connected to at least two of the plurality of level one multiplexers;a plurality of second-level multiplexers each having data inputs and adata output, the second-level multiplexers organized into multiplexergroups; and a plurality of lookup tables, each lookup table associatedwith one of the multiplexer groups and having a plurality of lookuptable inputs, each lookup table input coupled to the output of adifferent one of the second-level multiplexers in the one of themultiplexer groups with which it is associated; wherein the data inputsof the second-level multiplexers are connected to the outputs of thefirst-level multiplexers such that each output of each first-levelmultiplexer is connected to an input of only one second-levelmultiplexer in each multiplexer group.
 2. The interconnect architectureof claim 1 which further comprises: an interconnect matrix crossbarcoupling data inputs from the plurality of second level multiplexers todata outputs of the plurality of first level multiplexers.
 3. Aninterconnect architecture for a programmable logic device comprising: aplurality of interconnect routing lines; a plurality of first-levelmultiplexers each having data inputs and a data output, the data inputsof the first-level multiplexers connected to the plurality ofinterconnect routing lines such that each of the plurality ofinterconnect routing lines is connected to at least two of the pluralityof level one multiplexers; a plurality of second-level multiplexers eachhaving data inputs and a data output, the second-level multiplexersorganized into multiplexer groups; and a plurality of lookup tables,each lookup table associated with one of the multiplexer groups andhaving a plurality of lookup table inputs, each lookup table inputcoupled to the output of a different one of the second-levelmultiplexers in the one of the multiplexer groups with which it isassociated; wherein the number of first-level multiplexers is equal toat least the number of second-level multiplexers.
 4. The interconnectarchitecture of claim 3 which further comprises: an interconnect matrixcrossbar coupling data inputs from the plurality of second levelmultiplexers to data outputs of the plurality of first levelmultiplexers.
 5. An interconnect architecture for a programmable logicdevice comprising: a plurality of interconnect lines; a plurality offirst-level multiplexers, each having a plurality of inputs and anoutput; a plurality of second-level multiplexers, each having aplurality of inputs and an output; and a plurality of look up tables,each having a plurality of inputs; and wherein the plurality offirst-level multiplexers is organized into groups, the plurality ofsecond-level multiplexers is organized into groups, each groupcorresponding to a look up table, each input of each look up table isconnected to the output of a second-level multiplexer, a first input ofa first multiplexer in each group of second-level multiplexers isconnected to the output of a first multiplexer in a first group offirst-level multiplexers, and each input of each first level multiplexeris connected to one of the plurality of interconnect lines such thateach of the plurality of interconnect routing lines is connected to atleast two of the plurality of level one multiplexers.
 6. Theinterconnect architecture of claim 5 which further comprises: aninterconnect matrix crossbar coupling data inputs from the plurality ofsecond level multiplexers to data outputs of the plurality of firstlevel multiplexers.